Enables Matmul and Gemm for float16 on CPU #25913

xadupre · 2025-09-01T15:44:55Z

Description

Completes the implementation of Matmul and Gemm for float16 on CPU.

Motivation and Context

See issue #25824, a benchmark should validate that because float32 is usually faster than float16 on CPU. Right now, optimizers insert cast to be able to run the graph. It still seems the best approach. The benchmark needs to run be on other processors to see when the kernel should be added.

float32 size=256, time=0.00023763089993735775s per iteration
float16 size=256, time=0.01750569239993638s per iteration
float32 size=512, time=0.0013353320497117237s per iteration
float16 size=512, time=0.11820917490003921s per iteration
float32 size=1024, time=0.013059543249983107s per iteration
float16 size=1024, time=0.9459635703999083s per iteration

benchmark code

  # code
  import time
  import numpy as np
  import onnx
  import onnx.helper as oh
  import onnxruntime
  
  
  def model_type(itype):
      return oh.make_model(
          oh.make_graph(
              [oh.make_node("MatMul", ["X", "Y"], ["Z"])],
              "b",
              [
                  oh.make_tensor_value_info("X", itype, ["a", "a"]),
                  oh.make_tensor_value_info("Y", itype, ["a", "a"]),
              ],
              [oh.make_tensor_value_info("Z", itype, ["a", "a"])],
          ),
          ir_version=10,
          opset_imports=[oh.make_opsetid("", 18)],
      )
  
  
  sess16 = onnxruntime.InferenceSession(
      model_type(onnx.TensorProto.FLOAT16).SerializeToString(),
      providers=["CPUExecutionProvider"],
  )
  sess32 = onnxruntime.InferenceSession(
      model_type(onnx.TensorProto.FLOAT).SerializeToString(),
      providers=["CPUExecutionProvider"],
  )
  
  N = 20
  for size in [256, 512, 1024]:
  
      # float32
  
      f32 = dict(
          X=np.random.randn(size, size).astype(np.float32),
          Y=np.random.randn(size, size).astype(np.float32),
      )
      # warmup
      for i in range(10):
          sess32.run(None, f32)
  
      # measure
      begin = time.perf_counter()
      for i in range(20):
          sess32.run(None, f32)
      duration = time.perf_counter() - begin
      print(f"float32 size={size}, time={duration / N}s per iteration")
  
      # float16
  
      f16 = {k: v.astype(np.float16) for k, v in f32.items()}
      # warmup
      for i in range(10):
          sess16.run(None, f16)
  
      # measure
      begin = time.perf_counter()
      for i in range(N):
          sess16.run(None, f16)
      duration = time.perf_counter() - begin
      print(f"float16 size={size}, time={duration / N}s per iteration")
  ```

</details>

xadupre added 2 commits September 1, 2025 17:43

Enables Matmul and Gemm for float16 on CPU

65581d5

remove unnecessary includeé

eee1858

xadupre mentioned this pull request Sep 1, 2025

[Performance] ONNX FP16 model is having performance bottle neck when compared to FP32 variant #25824

Open

xadupre added 2 commits September 1, 2025 18:25

add tempalte<>

c159d6c

fix documentation

34acf01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enables Matmul and Gemm for float16 on CPU #25913

Enables Matmul and Gemm for float16 on CPU #25913

Uh oh!

xadupre commented Sep 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Enables Matmul and Gemm for float16 on CPU #25913

Are you sure you want to change the base?

Enables Matmul and Gemm for float16 on CPU #25913

Uh oh!

Conversation

xadupre commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Uh oh!

xadupre commented Sep 1, 2025 •

edited

Loading